W12 lab assignment



In [1]:

    
import pandas as pd
from urllib.request import urlopen
import json
import warnings
warnings.filterwarnings("ignore")

Choropleth map

Let's make a choropleth map with Pokemon statistics. The color of a county should correspond to the number of Pokemons found there. You can download the data from Canvas (pokemon.csv). The data is a subset of the pokemon data from Kaggle.

We'll also need an SVG map. You can download it from Wikipedia.

If you open the SVG with a text editor, you'll see many <path> tags. Each of these is a county. We want to change their style tags, namely the fill color. We want the darkness of fill to correspond to the number of Pokemons in each county.

In the SVG, there is also an id tag for each path, which is actually something called a FIPS code. FIPS stands for Federal Information Processing Standard. Every county has a unique FIPS code, and it’s how we are going to associate each path with our pokemon data.

For this we first need to do some data cleaning.



In [2]:

    
pokemon = pd.read_csv('pokemon.csv')
pokemon.head()









    Out[2]:






  
    
      
      pokemonId
      latitude
      longitude
    
  
  
    
      0
      16
      20.525745
      -97.460829
    
    
      1
      133
      20.523695
      -97.461167
    
    
      2
      16
      38.903590
      -77.199780
    
    
      3
      13
      47.665903
      -122.312561
    
    
      4
      133
      47.666454
      -122.311628

The data only has the latitude and longitude data. To convert this to an FIPS code, we need some reverse-geocoding. The Federal Communications Commission provides an API for such tasks.

The API works through an HTTP request, so we can use Python's urllib library to handle it. For example:



In [3]:

    
res = urlopen("http://data.fcc.gov/api/block/find?format=json&latitude=28.35975&longitude=-81.421988").read().decode('utf-8')
res









    Out[3]:





'{"messages":["FCC0001: The coordinate lies on the boundary of mulitple blocks, first FIPS is displayed. For a complete list use showall=true to display \'intersection\' element in the Block"],"Block":{"FIPS":"120950170151016"},"County":{"FIPS":"12095","name":"Orange"},"State":{"FIPS":"12","code":"FL","name":"Florida"},"status":"OK","executionTime":"74"}'

The result comes as a json object, so we need to parse it with Python's json decoder.



In [4]:

    
json.loads(res)









    Out[4]:





{'Block': {'FIPS': '120950170151016'},
 'County': {'FIPS': '12095', 'name': 'Orange'},
 'State': {'FIPS': '12', 'code': 'FL', 'name': 'Florida'},
 'executionTime': '74',
 'messages': ["FCC0001: The coordinate lies on the boundary of mulitple blocks, first FIPS is displayed. For a complete list use showall=true to display 'intersection' element in the Block"],
 'status': 'OK'}

Now we can access it as a dictionary and get the county's FIPS code.



In [5]:

    
json.loads(res)['County']['FIPS']









    Out[5]:





'12095'

We can do this to all data in the dataframe. Pandas's apply is a very nice feature that you may want to use, it allows you to write a function and apply it to the dataframe.



In [6]:

    
# TODO: create a column in the dataframe called 'FIPS' for the FIPS codes. 
# You should have the dataframe look like the following.
# Note that looking up all the lat-lon pairs may take some time.
def get_fips(row):
    res = urlopen("http://data.fcc.gov/api/block/find?format=json&latitude="+str(row['latitude'])+"&longitude="+str(row['longitude'])).read().decode('utf-8')
    return json.loads(res)['County']['FIPS']
pokemon['FIPS'] = pokemon.apply(get_fips, axis=1)



In [7]:

    
pokemon.head()









    Out[7]:






  
    
      
      pokemonId
      latitude
      longitude
      FIPS
    
  
  
    
      0
      16
      20.525745
      -97.460829
      None
    
    
      1
      133
      20.523695
      -97.461167
      None
    
    
      2
      16
      38.903590
      -77.199780
      51059
    
    
      3
      13
      47.665903
      -122.312561
      53033
    
    
      4
      133
      47.666454
      -122.311628
      53033

We want to color the counties by the number of pokemons appearing in them, so now all we need is a table with the counties' FIPS and number of pokemons in them.



In [8]:

    
pokemon_density = pd.DataFrame(pokemon.groupby('FIPS').size().reset_index())
pokemon_density.columns = ['FIPS', 'Count']



In [9]:

    
pokemon_density.head()

Now we can turn to our SVG file. We want to find the paths for each county: there are over 3000 counties, so we'll need a nice way. For this, we can use the BeautifulSoup package. This is a package specialized at parsing XMLs. SVGs are essentially XML files, so can be handled in the same way as handling HTML and other XML files.



In [10]:

    
from bs4 import BeautifulSoup

Read in the svg



In [11]:

    
svg = open('USA_Counties_with_FIPS_and_names.svg', 'r').read()

Load it with BeautifulSoup



In [12]:

    
soup = BeautifulSoup(svg)

BeautifulSoup has a findAll() function that finds all given tags.



In [13]:

    
paths = soup.findAll('path')



In [14]:

    
paths[0]









    Out[14]:





<path d="M 62.678745,259.31235 L 63.560745,258.43135 L 64.220745,257.99135 L 64.439745,258.43135 L 64.000745,258.65135 L 64.439745,258.65135 L 66.643745,257.99135 L 68.626745,255.56635 L 70.388745,256.44835 L 70.388745,256.89035 L 69.727745,257.54935 L 69.727745,258.21235 L 70.388745,257.99135 L 70.829745,256.89035 L 71.269745,256.44835 L 71.930745,257.10835 L 72.150745,257.99135 L 72.811745,258.21235 L 73.030745,257.77135 L 74.131745,257.54935 L 75.894745,257.54935 L 76.113745,257.77135 L 75.673745,258.43135 L 75.673745,258.65135 L 76.996745,258.87235 L 76.774745,259.53235 L 77.656745,259.53235 L 78.757745,258.87235 L 81.180745,258.65135 L 82.722745,259.09235 L 83.386745,259.09235 L 84.044745,259.31235 L 84.267745,259.53235 L 85.148745,259.53235 L 86.249745,259.31235 L 87.572745,259.31235 L 89.114745,259.75435 L 89.554745,259.53235 L 90.436745,258.87235 L 90.655745,258.65135 L 91.096745,258.21235 L 92.639745,258.43135 L 96.163745,259.53235 L 97.264745,263.05835 L 97.925745,265.26135 L 88.893745,267.46435 L 89.334745,269.88635 L 87.572745,270.32735 L 82.945745,271.21135 L 82.722745,271.21135 L 72.371745,272.31135 L 69.947745,272.31135 L 69.947745,271.87035 L 68.186745,271.87035 L 68.186745,271.42935 L 66.423745,271.42935 L 64.661745,271.64935 L 63.338745,271.64935 L 63.338745,271.21135 L 62.678745,271.21135 L 62.678745,271.42935 L 60.696745,271.42935 L 60.255745,271.21135 L 60.034745,271.21135 L 60.034745,271.42935 L 59.154745,271.42935 L 58.932745,270.98935 L 57.831745,270.98935 L 57.831745,271.42935 L 57.389745,271.42935 L 54.304745,271.21135 L 54.304745,272.08935 L 52.762745,272.08935 L 51.441745,271.42935 L 50.780745,270.54735 L 51.220745,269.22735 L 51.441745,267.68335 L 52.983745,267.90535 L 54.967745,267.68335 L 55.626745,267.46435 L 56.948745,265.92135 L 57.611745,263.93935 L 58.932745,261.95735 L 59.814745,261.07435 L 60.474745,261.29735 L 61.356745,260.85535 L 62.678745,259.31235" id="02185" inkscape:label="North Slope, AK" style="font-size:12px;fill:#d0d0d0;fill-rule:nonzero;stroke:#000000;stroke-opacity:1;stroke-width:0.1;stroke-miterlimit:4;stroke-dasharray:none;stroke-linecap:butt;marker-start:none;stroke-linejoin:bevel"></path>

We should also decide on the colors. colorbrew provides some nice palattes. Pick one of the sequential colors and make the hexadecimal encodings into a list.



In [15]:

    
colors = ['#fef0d9', '#fdd49e', '#fdbb84','#fc8d59','#e34a33','#b30000']



In [16]:

    
# TODO: substitute the above with a palatte of your choice.
colors = ['#f0f9e8','#bae4bc','#7bccc4','#43a2ca','#0868ac']

Now we’re going to change the style attribute for each path in the SVG. We’re just interested in fill color, but to make things easier we’re going to replace the entire style instead of parsing to replace only the color. Define the style as the following:



In [17]:

    
path_style = 'font-size:12px;fill-rule:nonzero;stroke:#000000;stroke-opacity:1;\
stroke-width:0.1;stroke-miterlimit:4;stroke-dasharray:none;stroke-linecap:butt;\
marker-start:none;stroke-linejoin:bevel'



In [18]:

    
for p in paths:
    try:
        cnt = int(pokemon_density[pokemon_density['FIPS'] == p['id']]['Count'])
        if cnt > 20: color_class = 4
        elif (cnt> 15 and cnt <= 20):color_class = 3
        elif (cnt > 10 and cnt <= 15):color_class = 2
        elif (cnt > 5 and cnt <= 10):color_class = 1
        else:  color_class = 0 
    except:
        continue

    # TODO: decide color classes
    
    color = colors[color_class]
    p['style'] = path_style +";fill:"+ color

Based on the number of pokemons, we want to assign the county to a color class. For example, if number > 50, use color1, if 40 < number <= 50, use color 2, etc.

Remember that we saved the svg in the soup object. Now that we have changed the svg to fill with colors, we can just write it out as a new file.



In [19]:

    
with open ('svg_colored.svg', 'w') as g:
    g.write(soup.prettify())

Open the new svg in your browser. You'll notice that only a few counties are colored: this is partly because we're only using a subset of the original data. The complete data has 296021 rows and looking up the FIPS will take too much time in class. If interested, you can download the full data and make a completed map.

	pokemonId	latitude	longitude
0	16	20.525745	-97.460829
1	133	20.523695	-97.461167
2	16	38.903590	-77.199780
3	13	47.665903	-122.312561
4	133	47.666454	-122.311628

	pokemonId	latitude	longitude	FIPS
0	16	20.525745	-97.460829	None
1	133	20.523695	-97.461167	None
2	16	38.903590	-77.199780	51059
3	13	47.665903	-122.312561	53033
4	133	47.666454	-122.311628	53033